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A cost-effective and effective agriculture management system is created by 
utilizing data analytics (DA), internet of things (IoT), and cloud computing 
(CC). Geographic information system (GIS) technology and remote sensing 
predictions give users and stakeholders access to a variety of sensory data, 
including rainfall patterns and weather-related information (such as pressure, 


humidity, and temperatures). They have unstructured format for sensory 
data. The current systems do a poor job of analysing such data since they 
cannot effectively balance speed and memory usage. An effective 
categorization model (ECM) on agriculture management system is proposed 
to address this research difficulty. First, a classification technique called 
priority-based k-nearest neighbour (KNN) is provided to categorize 
unstructured multi-dimensional data into a structured form. Additionally, the 
Hadoop MapReduce (HMR) framework is used to do classification utilizing 
a parallel approach. Data from real-time IoT sensors used in agriculture is 
the subject of experiments. The suggested approach significantly 
outperforms previous approaches that are computing time, memory 
efficiency, model accuracy, and speedup. 
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1. INTRODUCTION 

To improvise the agricultural productivity, it is essential to update the system with data such as 
yield, crop type, and crop growth conditions along with rainfall pattern data as well as weather related 
information (such as pressure, humidity, and temperature) time to time. The agro data captured by these 
sensors is usually in unstructured form and is moved to cloud environment though gateway or internet. For 
smart agro farming, an effective system is needed for storing, and analysing such unstructured type of data on 
cloud platforms. 

This research sought to address these issues and propose effective categorization model (ECM) 
methodology. In order to categorise unstructured type of multi-dimension high-dimensional data to structural 
form, a priority-based k-nearest neighbour (KNN) algorithm is first developed. Additionally, a concurrent 
categorization approach using the Hadoop MapReduce (HMR) architecture is provided. Figure 1 illustrates 
the design of a quick and effective agro data classification algorithm for an agricultural management system. 

The significance of proposed crop classification technique are as follows. First, a multi-dimension, 
high-dimensional, unstructured agro data classification system based on priority was developed. Next, a 
parallel classification approach using the HMR is described. The proposed classification model can perform 
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analysis considering real-time agro sensory data with good accuracy, reduced time, higher memory 
efficiency, and speedup. 

Because it can analyse enormous volumes of data and extract crucial information, big data (machine 
learning and deep learning) is used in precision agriculture. For the purpose of monitoring environmental 
factors on a farm, this project uses internet of things (IoT) technology for intelligent agriculture. Three- 
dimensional cluster analysis (3D CA) was used to study the environmental factors impacting the farm. The 
hyperspectral series of images or videos accelerates the rate at which data is generated and the volume at 
which it is produced, which poses challenges for big data, especially in applications for agricultural remote 
sensing. We provide an overview of the IoT, big data, and artificial intelligence (AI), as well as how these 
technologies will impact the agri-food sector in the future [1]-[4]. We undertake an analysis of the most 
recent research on the application of intelligent data processing technologies in agriculture, particularly in the 
production of rice. We provide a unified vision for IoT technology, data processing, and practical analytics in 
digital agriculture. Thanks to coronavirus disease-2019 (COVID-19), more people are now concerned about 
food safety, which is advantageous for the market share of smart agriculture. Contrary to existing solutions, 
the framework for integrating and analysing agricultural data from various sources provided in this research 
uses cloud computing (CC), which improves the solution's scalability, flexibility, affordability, and 
maintainability [5]—-[8]. 

We thoroughly assess agriculture mobile crowd sensing (AMCS) and offer recommendations for 
approaches to agricultural data collection. Using a small quantity of ground truth data, this work offered 
Gaussian kernel regression for estimating rice yield from optical and synthetic aperture radar (SAR) imaging. 
We provide a unique joint federated learning (FL) model based on partial least squares (PLS) regression and 
neural networks (NN) (FL-NNPLS). This paper suggested a high-resolution spatiotemporal image fusion 
approach (HISTIF) made up of multiplicative modulation of temporal change (MMTC) and filtering for 
cross-scale spatial matching (FCSM). First, we evaluate the state of industrial agriculture and the takeaways 
from industrialized agricultural production patterns in this essay [9]-[12]. We start by suggesting an image 
compression method for data gathering. Initially provide a picture compression method for data gathering. 
We analyse how close a drone using a long range (LoRa) radio essential fly toward sensors in order to gather 
the data within a certain level of data quality [13]-[16]. 

In this study, a brand-new mechanism for automatically defining zones for variable rate application 
is proposed. In this work, we demonstrate an embedded system enhanced with AI that enables continuous 
analysis and on-site prediction of plant leaf growth dynamics. Finding the significant technologies towards the 
advancement of intelligent agriculture that may successfully enhance the production efficiency to ensure the 
quality of the agricultural yields is done using data visualization analysis along with cluster analysis [17], [18]. 


Hadoop MapReduce 
Framework 


Cloud storage 


vy 


Data preprocessing Data analysis 


[KNN Classification 
| Algorithm 


N 


Qoterner gateway 


Agriculture Field thon station 
® “sé head Collect sensory information from 
sensor and transmit to base station 


Sensor device 


Figure 1. Accurate classification model's architectural design for a multi-level cloud storage concept 
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The paper is organized as following. In second section of paper provides the efficient classification 
methodology for analyzing raw unstructured data is presented. In penultimate section, experiment is 
conducted for evaluating accuracies of classification model is presented. The conclusion of research and 
future work is defined in last section. 


2. PRIORITY-BASED K-NEAREST NEIGHBOR CLASSIFICATION MODEL TO ANALYZE 
UNSTRUCTURED AGRO DATA 
This research provides a quick and effective classification algorithm for analysing unstructured 
agricultural data and storing it at various cloud storage levels (provider). Agriculture-related unstructured 
data is classified into structured data, for that a priority KNN algorithm is first introduced. To speed up the 
classification process for relatively large data, a parallel classification model utilising the HMR framework is 
then given. Figure 2 shows the block architecture of proposed classification model. 
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Figure 2. Block architecture of proposed classification model 


For analysis or categorization in this work, crop-monitoring datasets gathered from [19] are used. 
Sensory data acquired from various temperature, humidity, and gas sensors makes up the information. The 
circumstances under which wine and banana fruits mature are determined using this data. The data comprises 
11 attributes or dimensions, including id, time, R1, R2, R3, R4, R5, R6, R7, and R8, as well as temperature 
and humidity, and is made up of 919,438 data points that are dispersed throughout various locations and 
periods. The dataset used in this investigation is described in full in [19]. We categorised these data using 
priority clustering. Set to 3, the K (i.e. we take into consideration three groups, such as not affected, 
averagely affected, and totally impacted). The K can be modified to meet the criteria for user categorization. 
This is why we separate the data into three groups and store it in the cloud. 


2.1. Clustering model for classifying unstructured raw data into structured data 

The suggested priority-based KNN classification model is constructed by utilising k-mean clustering 
to divide the data points at each stage into L distinct areas. The data points in a location region are iteratively 
subjected to the same procedure following clustering. When there are less data points in an area than L, the 
iterative calculation is finished. Algorithm 1 presents the proposed priority-based KNN model. 


Algorithm 1. Building priority-based KNN algorithm 

Input: Agriculture Dataset E, diverging influenceL, maximum iterationj;, center selection 
strategy to be applied Dy. 

Output: Structured (Classified) data. 

if |E|L then 

build terminal node with feature points inFE. 

else 

Q < choose L data points from E using Dy. 

Converged«false 

Iterations Zero 

while converged = false && iteration<jJ; do 

D e cluster the feature points in E around closest centers Q 
Qu — averages of clusters in D 

if Q= Qy then 

Converged«true 

end if 


Qe- Qh 
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iterations«iteration + 1 

end while 

for each cluster D; €D do 

build non-terminal node with center Qj 


Continuously apply clustering method to the feature points in Dj 
end for 
end if 


The algorithm's feature or attribute known as the diverging influence is the number of clusters L that 
should be taken into account while separating the data at each node, and choosing L is important for 
achieving a successful classification conclusion. J_, which represents the maximum clustering iterations, is 
another parameter of the priority-based KNN clustering method. Smaller iterations can speed up clustering at 
the expense of accuracy. Finally, yet importantly, the parameter Ds is utilised to govern the initial centres 
selection in the clustering algorithm. The suggested priority-based KNN clustering, however, achieves good 
convergence with minimal time. The raw input data used to perform classification is displayed in Figure 3. 
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Figure 3. Raw input dataset used for performing classification operation 


From Figure 3 it is visible the raw data is composed of 20-dimension point, which is generated 
similar to [19], [20]. The complexity of computation mainly dependent on dimension size rather than size of 
data (rows). Classification is carried out to identify least affected (i.e. class a), averagely affected (i.e. class b) 
and most affected (i.e. class c) under assumption described in Figure 4. The outcome of classification model 
is shown in Figure 5. 
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Figure 4. Classification input data for performing classification operation 
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Figure 5. Classification outcome attained using priority-based KNN classification 


2.2. Parallelizing classification using HMR framework 

Additionally, this paper proposes a parallel classification system that makes use of the HMR 
framework [20]. Figure 6 depicts the HMR framework's fundamental design. Since HMR follows the 
execute-once paradigm, all state data for iterative execution should be put into distributed file system (DFS) 
and then read back in for each stage of algorithm calculation or evaluation. HMR is a widely used software 
model for MR computations that is accessible to the public (i.e. it is open source in nature). 
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Figure 6. The architecture of HMR framework 


2.3. Advantage of HMR 

Hadoop [21] is a distributed computing framework designed using java programing language 
adopting cloud-computing environment, which supports the MR architecture as shown in Figure 7. HMR has 
execute-once paradigm, implying that with iterative execution strategy all state data should be written into 
DFS and after that read back in for each progression of the algorithm calculation or evaluation. HMR is a 
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publicly available software model (i.e. it is open source in nature), and broadly utilized for MR calculations. 
Owen et al. [22] has been worked to keep running over HMR and the Hadoop distributed file system [23], 
[24]. Hadoop distributed file system (HDFS) is an execution of the google file system (GFS) where an 
extensively large dataset is fragmented into equal length of small blocks and a duplicate copy of each blocks 
maintained (this process is known as data replication). While handling the information, the framework pushes 
calculations to the virtual computing nodes where these chunks are facilitated to expand information location 
awareness amid computing for quicker algorithm computation makespan. At the point when HMR is initiated 
with HDFS, HMR can exploit information location awareness and push calculations to the information they 
should work on, eliminating the systems or network administration overhead, which might be caused when 
collecting from HDFS. This may offer the HMR based usage an edge in computing overheads when 
contrasted with other distributed and parallel processing architecture. 


2.4. Parallel classification algorithm for Hadoop MapReduce framework 

HMR is a combination of two important functions known as map and reduce as shown in Figure 7. 
Map function takes input data of same domain and generate list of pair value of result in different 
domain M (key,, valı) > l(keyz, valz). This created key key, a list of various values that was combined, 
and a reducer function. The reducer function uses the intermediate key key, and the values to create a new 
set of values called (valz). 


Data 
Store 1 


REDUCE 
WORKER 


Figure 7. HMR computation model 


In Hadoop, distributed system MapReduce job execution performed on multiple system or machine. 
Where one is master nodes and other is worker nodes or known as slave nodes. Master node distribute task 
among the worker nodes. Each slave nodes has fixed number of mapper and reducer function or can be called 
as map and reduce slots. Worker nodes periodically send their free or engaged map or reduced slots detail to 
the master node. Master nodes schedule the task based on availability of mapper reducer function in the 
cluster. 

The MapReduce function combines the tasks of mapping and reducing. The input dataset is divided 
into uniformly sized blocks of data, which are then distributed among the nodes of the Hadoop cluster. 
Applying a user-defined mapper function to the input from the map task results in intermediate output that 
serves as data for the reduce task's input. Reduced stage combines reduction phase and two-phase shuffle. 
The output data to the map job is used as an input into the shuffle phase, where the already completed map 
task is shuffled and then sorted. The sorted data is now sent into the user-defined reduce function, and the 
output is written back into HDFS. A map stage involves several distinct map tasks, each of which is listed. 

Reduce stage is combination of shuffle/sort and reduce phases. In reduce stage shuffle/sort phase 
start working only after the first map task completed. Working of shuffle phase completed after the all map 
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task work is completed. Once the shuffle/sort work over reduce task start working. Shuffle phase result 
obtained in first cycle may differ from result obtained in 2”! cycle. Result of shuffle phase varies due to 
dependency on Map cycle. Reduce shuffle phase measurement based on two reduce cycle one is called initial 
shuffle and other is called typical shuffle. Reduce phase begins once the shuffling phase is finished [20]. 
Provides information on HMR operation details. The Hadoop HDInsight cluster's distributed key building 
technique is displayed in Algorithm 2. This work uses distributed architecture to classify agricultural data, 
and our model achieves good accuracy, reduces computing time, and satisfies the real-time requirement, as 
empirically demonstrated in the next section. 


Algorithm 2. Building distributed Key on Hadoop HDInsight cluster 

Input: Data E, keyVal Q 

Output: ConstructKey(E,Q) 

j- MR_function() 

E; read chunk of the data E with respect to function j using Hadoop distributed file system. 
construct key in parallel on each worker with data Ej and keyVal Q 

MR_Cumulate(Q) // Synchronize all workers. 


3. RESULT AND DISCUSSION 

This section compares the proposed effective categorization model (ECM) approach to the current 
approach [25] and evaluates how well it performs in terms of speedup, accuracy, central processing unit 
(CPU) time, as well as memory overhead. The information is used to determine how temperature and 
humidity affect the effects of gases on wine and bananas. In general, spreading sensor devices around the 
agricultural area improves yield. The sensors keep a look on conditions such as temperature along with 
humidity and make decisions depending on them, such as whether to release water or use pesticides, among 
other things. Additionally, by keeping an eye on the wind, which helps predict the onset of rain, cyclones, 
and other weather events in a specific location with less delay, agriculture production can be improved. So 
that the right decision can be made at the right moment with the least amount of harm to the corps. To assess 
the performance in terms of memory and time efficiency when taking into account real-time agrosensor 
dataset received from [19] such as Inspiral, this work compares with previous technique [26]. This research is 
carried on the Windows 10 operating system (OS) along with I-7 processor (64-bit). The memory use in this 
research is 16 GB RAM along with 4 GB GPU dedicated with compute unified device architecture (CUDA) 
support. One master worker node and four slave worker nodes are taken into account while designing the 
HDInsight cluster utilizing the database Azure HDInsight cluster and A3. 

An experiment was carried out to evaluate the performance of ECM with the existing models [25], 
[26] in terms of total CPU time, memory overhead, and accuracy attained in generating classification trees 
for turning unstructured input into structured data. Table 1 shows the comparison along with several state of 
art approach for developing classification tree. Table 2 lists the results of this evaluation. The outcome 
demonstrates that artificial neural network (ANN) performs better than a random categorization model. 
Figure 8 shows the classification performance assessment considering different dimension size. We contrast 
the proposed outcome performance improvement to the ANN classification model therefore. While 
decreasing overall CPU time and memory overhead by 32.85% and 55.07%, respectively, the ECM-local 
classification model improves accuracy by 1.82%. Additionally, the ECM-Hadoop classification model 
obtains a speedup of 16, increases accuracy by 1.82%, decreases overall CPU time and memory overhead by 
95.86% and 84.05%, respectively. Additionally, we assessed how dimension size affected classification 
ability, as shown in Figure 8. As shown in Table 2, we modified the dimension size to be 4, 6, 8, and 10 and 
assessed the classification result in terms of total CPU time, accuracy, and memory overhead. The results of 
the experiment demonstrate that when dimension size rises, computation time and memory overhead also 
increase. Similar to this, precision is achieved when dimension size is 5 and increases to 11 to get accuracy of 
2.17. This makes it obvious that the size of the dimension affects categorization accuracy. The entire 
outcome demonstrates the ECM model's scalable performance in comparison to state-of-the-art models. 


Table 1. Comparison along with several state of art approach for developing classification tree 
Random [7] _ ANN [7] | ECM-Local |_| ECM-Hadoop 


Total CPU time (s) 129.69 52.5 35.25 2.37 

Average accuracy 0.977 0.971 0.989 0.989 
Memory overhead (kilobytes) 0.71 0.69 0.31 0.11 

Speedup 14 14 2 16 
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Table 2. Classification performance assessment by considering different dimension size 
Dimension size Total CPU time (s) Average accuracy Memory overhead (kilobytes) 


4 1.79 0.983 0.09 
6 1.86 0.986 0.099 
8 1.93 0.987 0.106 
10 2.17 0.989 0.11 
Average 1.9375 0.986 0.101 
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Figure 8. Classification performance assessment considering different dimension size 


4. CONCLUSION 

From the above research, we can establish an efficient classification technique regarding the 
performance analysis based on agro related data in unstructured form. Here a priority-based KNN 
classification model is presented, which performs the analysis on multi-dimensional data (high dimensional 
data). Here we have adopted a distributed computing framework for the analysis purpose. Parallel clustering 
algorithm approach by applying Hadoop framework is developed for establishing scalable performance 
during analysis of high dimensional data. All the research are carried out on real-time data scrapped from 
agro sensors. Further, the results display that the ECM-local reduces the total CPU time as well as memory 
overhead by 32.85% along with 55.07% respectively. Here the accuracy improvises by 1.82%. Likewise, the 
ECM-Hadoop model for classification decreases the total CPU time by 95.86% along with memory overhead 
by 84.05% respectively. Here the accuracy is improvised by 1.82% and the speedup enhances to 16. The 
overall performance result displays the scalable performance of developed ECM model when compared with 
several state-of-art paradigms on several parameters such as total CPU time as well as accuracy and memory 
efficiency along with speedup. Further, the future research would consider evaluating considering different 
dataset and minimize the storage and processing cost. 
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